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ABSTRACT 

Item response theory (IRT) is a useful and effective tool 
for item response measurement if used in the proper context. This paper 
discusses the sets of assumptions under which responses can be modeled while 
exploring the framework of the IRT models relative to response testing. The 
one parameter model, or one parameter logistic model, is perhaps the simplest 
of IRT models. It presumes that only a single item parameter is necessary to 
represent the item response procedure. This parameter distinction is termed 
difficulty and given the symbol, beta. All unidimensional IRT models operate 
on the belief that a single fundamental latent construct (theta) is the chief 
contributory determinant of the experimental responses to each test ' s items . 
In the two-parameter model, the discrimination parameter, or the Greek symbol 
of alpha, is added.' This allows the item characteristic curves (ICCs) for 
different items to exhibit different slopes. This discrimination parameter 
allows the modeling of the fact that some items have powerful (or feeble) 
associations to the fundamental construct being evaluated (theta) . The 
three -parameter model adds one more parameter to the two -parameter model to 
reveal the reality that the lower asymptote of the ICC in accounting for 
guessing may well require the acceptance of nonzero values for their 
effective minimum values. The paper reviews some studies involving the use of 
IRT and discusses the ways in which IRT benefits research and testing. 
(Contains 24 references.) (SLD) 
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Item Response Theory 



Introduction 

Because of the growing debate between the effectiveness of Item 
Response Theory and Classical Testing Theory, it becomes more 
imperative for the proponents of IRT to demonstrate its usefulness in 
education research. Originally, IRT methods were used principally with 
standardized achievement and aptitude tests organized with multiple 
guess items dichotomously scored (Harvey Hammer 1999). While item 
response theory methods have been reality for more than half a century, 
only of late have they begun to achieve extensive regard in psychological 
assessment (Harvey and Hammer 1999). The uniqueness of this model is 
often associated with the unidimensional measure in testing. According 
to Harvey and Hammer (1999), “The IRT based approach to test 
development has the advantage of allowing the test developer to easily 
determine the effect of adding, or deleting, a given test item or set of test 
items by examining the test information function and/or the standard 
error function for an item pool” (p.367). However, this same uniqueness 
has been the argument for ineffectiveness. Again, Harvey and Hammer 
(1999) suggest that it is important to understand IRT limitations in terms 
of usefulness and effectiveness. Researchers are asked to remember that 
when measuring, one is fitting to a mathematical model with certain 
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assumptions and limitations. Harvey and Hammer (1999) argue that 
there does not exist any guarantee that the rhodels involved with a given 
IRT approach, whether it being one parameter versus two parameter 
versus three parameter, will offer a sufficient data fit. 

As mentioned before, IRT models deal primarily with standardized 
test with a dichotomous set up (ie. . . items scored right or wrong, true 
false, ). According to Ayala and Bolesta (1999), “IRT is used in state 
testing programs, such as the Maryland State Department of Education 
High School Functional Assessment program and in municipal programs 
such as the Portland school District, for test equating” (p. 3). It is 
significant to underscore that the IRT models for dichotomous items are 
not limited to two alternative multiple choice arrangement, meaning they 
can be applied to multiple choice items that have any preferred number 
of response options and even to non-multiple choice items (Harvey Ss 
Hammer 1999). Essentially, the key prerequisite is that each person’s 
item response has the capacity to be scored to manufacture a dichotomy. 
Additionally, IRT models deal with a poloychotomous or poloytomous 
fashion. According to Ayaloa and Bolesta (1999), “Most IRT work has 
been based' on dichotomous models, however, not all examinee-item 
interaction can be modeled .by a dichotomous model. For example, to 
capture the information in a Likert item or to assign credit for a partially 
correct answer requires a polytomous model” (p. 3). Polytomous models 
contain more item parameters than the dichotomous model so larger 
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samples are required. For example, Ayala and Bolesta (1999) note “larger 
ratio of examinees to item parameters was needed for Master’s (1982) 
partial credit model to produce stable item and trait parameter 
estimates, regardless of the number of categories” (p. 4). 

Item Response Theory is a useful and effective tool for item 
response measure if used in the proper context. According to Edward 
Hak-Sing (2001), “For more than a decade, the Item Response Theory 
(IRT) has provided a framework under which dichotomous and 
poloytomous responses to items can be modeled under a specific set of 
assumptions” (p. 109). This paper will discuss these specific sets of 
assumptions while exploring the framework of the IRT models relative to 
response testing. 



One Parameter Model 

The one parameter model or one parameter logistic model is 
perhaps the simplest of IRT models. As its title entails, it presumes that 
only a single item parameter is necessary to represent the item response 
procedure (Harvey 86 Hammer 1999). This parameter distinction is 
termed difficulty or given the symbol b. According, to Harvey and Hammer 
(1999), “operationally, it [one parameter] is defined as the score on theta 
that is associated with 50% likelihood of a correct/ endorsed item 
response” (p. 359). All uni-dimensional IRT models impart the belief that 
a single fundamental latent construct [theta] is the chief contributory 
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determinant of the experimental responses to each test’s items (Harvey dm 
Hammer 1999). In a study conducted by Prieto, Roset, and Badia, 
entitled Rash Measurement in the Assessment of Growth Hormone 
Deficiency in Adult Patients (2001), a sample of 356 repeated adult 
patients with untreated- GHD was incorporated in the study. Patients 
answered the survey at 12 months apart. Responses were evaluated 
following the dichotomous logistic response model. Parameter 
approximates, model-data fit and separation statistics were calculated. 
The invariance of the item parameters across time was tested in the 
follow-up. Rasch results were furthermore employed to determine score 
differences through the computation of the Reliable Change Index (p. 49). 
One disadvantage to the 1 -parameter model is its postulation that all 
items in the test share the same shaped ICC’s; while this might be 
realistic in an item group that was quite abnormal in many practical 
measurement conditions (Harvey dm Hammer 1999). According to Junker 
86 Sijtsma (2000), “The monotonicity of item response functions is a 
central feature of most parametric and nonparametric item response 
models” (p. 65). Monotonicity permits items to be understood as 
calculating a trait, and it permits for a common speculation of 
nonparametric deduction traits (Junker 86 Sijtsma 2000). 
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Two Parameter Model 

In the two parameter model, the discrimination parameter or a is 
added. This allows the ICC’s for different items to exhibit different slopes. 
The discrimination parameter allows us to model the fact that some 
items have powerful (or feeble) associations to the fundamental construct 
being evaluated (theta); superior values denote firm associations (Harvey 
and Hammer 1999). According to Harvey 8& Hammer (1999), “The a 
parameter is very important in IRT due to the fact that it directly 
determines the amount of information provided by an item; Items with 
higher a parameters provide more information regarding theta, all other 
factors being equal” (p. 361). Rogers and Ndalichako (2000), illustrate the 
2-parameter item response while performing an analysis of 1232 high 
school seniors. In their article, Number-Right, Item-Response, and Finite- 
State Scoring: Robustness With Respect To Lack of Equally Classifiable 
Options And Item Option Independence (2000) , their analysis concluded 
that the number of right and 1 and 2 parameter methods were equally 
sensitive to the presence of absurd option of stem option connections 
and pairs of similar or opposite options (5). 

Three Parameter Model 

Even though the 2 parameter model deals with one of the most 
grave assessments of the Rasch model ie., the postulation that all test 
items are alike with regard to their discriminating power), it does not 
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address another likely significant fact that may be different across items. 
The 3 parameter model adds one more parameter (c) to the two 
parameter model to reveal the reality that the lower asymptote of the ICC 
in accounting for guessing, may well require the acceptance of nonzero 
values for their effective minimum values (Harvey 86 Hammer 1999). In a 
study done by Hoskens and Boeck (2001), one can observe the function 
of a 3 parameter model. They state: 

The framework for modeling componential data using item 
response theory models for polytomous items is presented. This 
framework models response accuracies on complex cognitive tasks, 
which are decomposed in terms of more basic elements, such as 
knowledge structures, cognitive processes, and strategies” (19). 

The following is a graph illustrating the Three- parameter model in 
Item Response Theory. It is readily observed how the one and two 
parameters compliment the third parameter within this model. The beta 
or difficulty level is .12, while the guessing parameter or third parameter 
is set at .17. One can observe that the discrimination parameter a is at 
.92 level, which illustrates that this item as compared to others would be 
good for testing. 
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Item Response Function and Item Information 

Subtest 1 : random ; Item 8: 0008 

a = 0.92; b"0.12; c = 0.17; 
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' Articles Detailing Studies Involving IRT 



There exist several studies involving the use of Item Response 
Theory. A study by Bickel, Buyske, Chang, and Ying (2001) states, “An 
important assumption in IRT model-based adaptive testing is that 
matching difficulty levels of test items with an examinee’s ability makes a 
test more efficient” (p. 69). In their article, the premise of their study 
deals with “when adding an item to a test, the improvement of accuracy 
is an increasing function of the item information, this assumption 
amounts to claiming that the item information is maximized when its 
difficulty level matches the examinee’s ability^’ (p. 69). In another study, 
Ogasawara (2000) examines how asymptotic standard errors of item 
response theory are derived. According to Ogasawara (2000), “Two 
variations of the item and test response function methods and SEs of 
their parameter estimates are presented that use logit transformations of 
the item response functions” (p. 53). Ogasawara concludes that 
numerical examples show similarities between the small size SEs of the 
item and the small size test response function methods are actually 
smaller than those of other methods (p. 53). Another article examines 
how Rasch models differ by observing the contents of the Rasch 
measurement partial credit model and comparing it to other Rasch 
models. Bode (1998) discusses “The calibration of instruments with 
increasingly complex items is described, starting with dichotomous items 
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and moving on the poloychotomous items using a single rating scale, and 
mixed polychotomous items using multiple rating scales, and 
instruments in which each item has its own rating scale (p. 78). 

Another study, which outlines the use of the discrimination and guessing 
.parameter entitled. Optimal Item Discrimination and Maximum Information 
for Logistic IRT Models was done by Veerkamp and Berger (1999). 
According to their article, “This study derives discrimination parameter 
values, as functions of the guessing parameter and distances between 
person parameters and item difficulty, that yield maximum information 
for the three-parameter logistic item response theory model” (p. 31). 

Reise (2000) discusses in his article. Using multilevel Regression to 
Evaluate Person-Fit IRT Models, “how multilevel logistic regression can be 
used to assess the consistency of an individuals response pattern with 
an item response theory measurement model” (p. 543). In Kamata’s 
(2001) article. Item Analysis by the Hierarchical Generalized Linear 
Model, the hierarchical generalized model is presented as an explicit two- 
level formulation of multilevel item response model” (p. 79). Discussion is 
provided that explains how the HGLM model is equivalent to the Rasch 
model as well as an examination of how the characteristics of the HGLM 
model can be expressed as a latent regression model (pp. 79-93). 
Fernando, Lorenzo and Molina (2001) discuss how the item response 
theory model of response stability is developed based on the local 
independence principle. In their article. An Item Response Theory 
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Analysis of Response Stability in Personality Measurement (2001), they 
examine how “the model predicts response changes under repeated 
administration of the same instrument using item and examinee 
parameter estimates as predictors” (p. 3). Muraki, Hombo and Lee 
(2000), discuss Equating and Linking of Performance Assessments, and 
offer an overview of linking methods applied to performance 
assessments. “Major issues and recent developments in linking 
performances are discussed. Three common linking designs (single 
group, randomly equivalent groups, and nonequivalent groups with 
anchor items) are compared (p. 325). Segall’s (2001) article. General 
Ability Measurement: An Application Of Multidimensional Item Response 
Theory, discusses ways to improve measurement accuracy. “One method 
provides a multidimensional item response theory estimate obtained 
from conventional administration of multiple choice test items, while the 
other method chooses items adaptively to maximize the precision of the 
general ability” (p. 79). Schulz, Kolen, and Nicewander (1999), explore a 
new procedure for defining achievement tests in their article, A Rationale 
for Defining Achievement Levels Using IRT-Estimated Domain Scores. “This 
procedure assigns examinees to levels of achievement when the levels are 
represented by separate pools of multiple choice items. Items were 
assigned to levels on the basis of their content and hierarchically defined 
level descriptions” (p. 347). Wolfe’s (2000), Equating and Item Banking 
with the Rasch Model, discusses the Rasch measurement procedures for 
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equating multiple test forms. “The procedures entail selecting an 
appropriate data collection design, estimating parameters, transforming 
the parameters from multiple forms to a common scale and evaluating 
the quality of the linkage between these forms” (p. 409). 

Conclusion 

Item Response Theory is historically the most widely used form of 
item analysis (Harvey & Hammer 1999). Only until recently has it gained 
popularity within educational research and psychological measurement. 
IRT has been widely accepted in standardized aptitude testing. According 
to Harvey & Hammer (1999), “one very practical reason for this belated 
popularity is the fact IRT techniques tend to be far more computationally 
demanding than methods of test construction and scoring that are based 
on classical test theory” (p. 353). Item Response Theory offers quality 
item response analysis that benefit researchers and psychologists alike. 
“IRT benefits research and testing by the fact that it provides a much 
more detailed view of item-level and test level functioning. It can be 
adapted to many different kinds of tests; the score estimation process is 
more precise, allowing simultaneous consideration of both the number of 
right/ endorsed items as well as the properties (difficulty, discrimination) 
of each item” (Harvey & Hammer 1999). 
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